1 Climate change and temperature anomalies

We are analysing a dataset from NASA’s Goddard Institute for Space Studies to study the effects of climate change in the Northern Hemisphere. Glimpsing at the data, we see there are 19 variables and 143 observations, representing the period between 1880-2022:

weather <- 
  read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv", 
           skip = 1, 
           na = "***")
glimpse(weather)
## Rows: 143
## Columns: 19
## $ Year  <dbl> 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890…
## $ Jan   <dbl> -0.36, -0.30, 0.26, -0.58, -0.16, -1.01, -0.75, -1.08, -0.49, -0…
## $ Feb   <dbl> -0.50, -0.21, 0.21, -0.66, -0.11, -0.45, -0.84, -0.71, -0.61, 0.…
## $ Mar   <dbl> -0.23, -0.03, 0.02, -0.15, -0.64, -0.23, -0.71, -0.44, -0.64, -0…
## $ Apr   <dbl> -0.29, 0.01, -0.30, -0.30, -0.59, -0.49, -0.37, -0.38, -0.22, 0.…
## $ May   <dbl> -0.06, 0.05, -0.23, -0.25, -0.36, -0.58, -0.34, -0.26, -0.15, -0…
## $ Jun   <dbl> -0.16, -0.32, -0.28, -0.11, -0.41, -0.45, -0.37, -0.20, -0.03, -…
## $ Jul   <dbl> -0.17, 0.09, -0.28, -0.05, -0.41, -0.33, -0.14, -0.24, 0.00, -0.…
## $ Aug   <dbl> -0.25, -0.03, -0.14, -0.22, -0.51, -0.41, -0.43, -0.55, -0.21, -…
## $ Sep   <dbl> -0.22, -0.26, -0.24, -0.34, -0.45, -0.40, -0.33, -0.21, -0.20, -…
## $ Oct   <dbl> -0.30, -0.43, -0.51, -0.16, -0.44, -0.37, -0.31, -0.49, -0.03, -…
## $ Nov   <dbl> -0.41, -0.36, -0.33, -0.44, -0.57, -0.38, -0.40, -0.27, -0.01, -…
## $ Dec   <dbl> -0.40, -0.23, -0.68, -0.15, -0.47, -0.11, -0.22, -0.43, -0.24, -…
## $ `J-D` <dbl> -0.28, -0.17, -0.21, -0.28, -0.43, -0.43, -0.43, -0.44, -0.24, -…
## $ `D-N` <dbl> NA, -0.18, -0.17, -0.33, -0.40, -0.46, -0.43, -0.42, -0.25, -0.1…
## $ DJF   <dbl> NA, -0.30, 0.08, -0.64, -0.14, -0.64, -0.57, -0.67, -0.51, -0.08…
## $ MAM   <dbl> -0.19, 0.01, -0.17, -0.24, -0.53, -0.43, -0.47, -0.36, -0.34, 0.…
## $ JJA   <dbl> -0.19, -0.09, -0.23, -0.13, -0.44, -0.40, -0.31, -0.33, -0.08, -…
## $ SON   <dbl> -0.31, -0.35, -0.36, -0.31, -0.48, -0.38, -0.35, -0.32, -0.08, -…

For the purpose of our analysis, we have decided to select only data pertaining to temperature deviation (delta) by month, and manipulate the dataframe to facilitate further investigation:

tidyweather <- select(weather, 1:13) %>% 
  pivot_longer(!Year, names_to = 'month', values_to = 'delta')
head(tidyweather)
## # A tibble: 6 × 3
##    Year month delta
##   <dbl> <chr> <dbl>
## 1  1880 Jan   -0.36
## 2  1880 Feb   -0.5 
## 3  1880 Mar   -0.23
## 4  1880 Apr   -0.29
## 5  1880 May   -0.06
## 6  1880 Jun   -0.16

1.1 Plotting Information

First, we are plotting a scatter plot to visualize the evolution of delta (temperature deviation) over time:

tidyweather <- tidyweather %>%
  mutate(date = ymd(paste(as.character(Year), month, "1")),
         month = month(date, label=TRUE),
         year = year(date))

ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point()+
  geom_smooth(color="red") +
  theme_bw() +
  labs (
    title = "Weather Anomalies"
  )

Adding a line of best fit to the scatterplot clearly shows that, while deltas were close to 0 between approximately 1935-1975 and negative before that time, temperature in the Northern Hempishere has been quickly increasing since then. Notice that the rate of the increase has been increasing as well (as signified by increasing deltas).

Next, we will use facet_wrap() to visualize the effects of increasing temperature by month:

We can see that the effect of rising temperature in the Northern Hemisphere is common to all months of the year.

As a means to further investigate the effects of climate change, we will partition the data into time periods, particularly decades. To that end, we will use case_when():

comparison <- tidyweather %>% 
  filter(Year>= 1881) %>%     #remove years prior to 1881
  #create new variable 'interval', and assign values based on criteria below:
  mutate(interval = case_when(
    Year %in% c(1881:1920) ~ "1881-1920",
    Year %in% c(1921:1950) ~ "1921-1950",
    Year %in% c(1951:1980) ~ "1951-1980",
    Year %in% c(1981:2010) ~ "1981-2010",
    TRUE ~ "2011-present"
  ))

In order to study the effects of climate change by decade, we will produce a density plot to investigate the distribution of monthly temperature deviations by the time periods selected above:

ggplot(comparison) +
  aes(x = delta, fill = interval)+
  #facet_wrap(~month)+
  geom_density(alpha = 0.2) 

The plot clearly shows that with the passage of time, deltas have consistently moved to the right hand side of the plot. In other words, temperature deviations have been increasing over time.

Lastly, we will also consider annual anomalies by grouping the monthly data and producing a scatterplot:

#creating yearly averages
average_annual_anomaly <- tidyweather %>% 
  group_by(Year) %>%   #grouping data by Year
  
  # creating summaries for mean delta 
  # use `na.rm=TRUE` to eliminate NA (not available) values 
  summarise(yearly_mean = mean(delta, na.rm=TRUE)) 
  
average_annual_anomaly
## # A tibble: 143 × 2
##     Year yearly_mean
##    <dbl>       <dbl>
##  1  1880      -0.279
##  2  1881      -0.168
##  3  1882      -0.208
##  4  1883      -0.284
##  5  1884      -0.427
##  6  1885      -0.434
##  7  1886      -0.434
##  8  1887      -0.438
##  9  1888      -0.236
## 10  1889      -0.178
## # … with 133 more rows
#plotting the data
#Fit the best fit line, using LOESS method
ggplot(average_annual_anomaly) +
  aes(x = Year, y = yearly_mean)+
  geom_point()+
  geom_smooth(method = 'lm') +
  theme_bw()

The plot proves that annual temprature deltas have been increasing over time.

1.2 Confidence Interval for delta

We will now focus on the time period between 2011-present. We divert our attention towards producing a confidence interval for the average annual deltas calculated since 2011. We will use both the statistical method and bootstrap simulation to have higher confidence in the results:

#statistical method
formula_ci <- comparison %>% 
  filter(interval == '2011-present') %>% 
  group_by(year) %>% 
  summarise(avg_annual_delta = mean(delta),
            sd_delta = sd(delta),
            count = n(),
            SE = sd(delta/sqrt(count)),
            upper_ci = avg_annual_delta + 2*SE,
            lower_ci = avg_annual_delta - 2*SE)

#bootstrap simulation  
formula_ci_2 <- comparison %>% 
  filter(interval == '2011-present') %>% 
  specify(response = delta) %>% 
  generate(type = 'bootstrap') %>% 
  calculate(stat = 'mean') %>% 
  get_confidence_interval(level = 0.95)

#print out formula_CI
formula_ci
## # A tibble: 12 × 7
##     year avg_annual_delta sd_delta count      SE upper_ci lower_ci
##    <dbl>            <dbl>    <dbl> <int>   <dbl>    <dbl>    <dbl>
##  1  2011            0.745    0.113    12  0.0327    0.810    0.680
##  2  2012            0.813    0.180    12  0.0520    0.917    0.709
##  3  2013            0.8      0.118    12  0.0340    0.868    0.732
##  4  2014            0.919    0.146    12  0.0421    1.00     0.835
##  5  2015            1.18     0.178    12  0.0515    1.28     1.07 
##  6  2016            1.31     0.332    12  0.0958    1.50     1.11 
##  7  2017            1.18     0.227    12  0.0655    1.31     1.05 
##  8  2018            1.03     0.137    12  0.0395    1.11     0.956
##  9  2019            1.21     0.153    12  0.0441    1.30     1.12 
## 10  2020            1.35     0.225    12  0.0648    1.48     1.22 
## 11  2021            1.14     0.115    12  0.0332    1.20     1.07 
## 12  2022           NA       NA        12 NA        NA       NA
formula_ci_2
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.04     1.04

Looking at the results of the analysis, we can see that the statistical method produces wider confidence intervals for temperature deltas, ranging from 0.13 to approximately 0.3 in width. This is probably due to the low number of observations (12 months in each year), which prohibit a more precise calculation. On the other hand, using bootstrap simulation allows to get a much narrower confidence interval. However, both methods show that temperature deltas have been positive in the time period in question and have been consistently greater than 1 since 2015.

2 Biden’s Approval Margins

In this section, we will analyse the evolution of approval margins of US President Joe Biden. Glimpsing at the dataset, we notice there are 22 variables and 4,495 observations:

# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv') 

glimpse(approval_polllist)
## Rows: 4,495
## Columns: 22
## $ president           <chr> "Joe Biden", "Joe Biden", "Joe Biden", "Joe Biden"…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate           <chr> "9/9/2022", "9/9/2022", "9/9/2022", "9/9/2022", "9…
## $ startdate           <chr> "1/19/2021", "1/19/2021", "1/20/2021", "1/20/2021"…
## $ enddate             <chr> "1/21/2021", "1/21/2021", "1/21/2021", "1/21/2021"…
## $ pollster            <chr> "Rasmussen Reports/Pulse Opinion Research", "Morni…
## $ grade               <chr> "B", "B", "B", "B+", "B", "B-", "B+", "B-", "B", "…
## $ samplesize          <dbl> 1500, 15000, 1993, 1516, 15000, 1115, 941, 1200, 1…
## $ population          <chr> "lv", "a", "rv", "a", "a", "a", "rv", "rv", "a", "…
## $ weight              <dbl> 0.3382, 0.2594, 0.0930, 1.2454, 0.2333, 1.1014, 1.…
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve             <dbl> 48.0, 50.0, 56.0, 45.0, 51.0, 55.5, 63.0, 58.0, 52…
## $ disapprove          <dbl> 45.0, 28.0, 31.0, 28.0, 28.0, 31.6, 37.0, 32.0, 29…
## $ adjusted_approve    <dbl> 49.2, 49.4, 55.4, 46.0, 50.4, 54.6, 59.5, 57.5, 51…
## $ adjusted_disapprove <dbl> 40.3, 30.9, 33.9, 29.0, 30.9, 32.5, 38.4, 32.8, 31…
## $ multiversions       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tracking            <lgl> TRUE, TRUE, NA, NA, TRUE, NA, NA, NA, TRUE, TRUE, …
## $ url                 <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id             <dbl> 74247, 74272, 74246, 74327, 74273, 74248, 74256, 7…
## $ question_id         <dbl> 139395, 139491, 139394, 139570, 139492, 139404, 13…
## $ createddate         <chr> "1/22/2021", "1/28/2021", "1/22/2021", "2/2/2021",…
## $ timestamp           <chr> "09:48:31  9 Sep 2022", "09:48:31  9 Sep 2022", "0…

2.1 Create a plot

We will first calculate the net approval rate (approve - disapprove) for each week in 2022 along with its 95% confidence interval, and then plot the results as a line plot grouping by respodent group (Adults, Voters, All Groups).

fixed_dates <- approval_polllist %>%
  mutate(date = mdy(enddate),
         weeks = week(date),
         year = year(date),
         net_rate = approve - disapprove) %>%
  filter(year == 2022, weeks<50) %>%
  group_by(subgroup , weeks) %>%
  
  # we calculated the 95% confidence interval. 
  summarise(mean_rate = mean(net_rate,na.rm=TRUE),
            sd_rate = sd(net_rate,na.rm=TRUE),
            number = n(),
            t_critical = qt(0.975,number-1),
            lower_bound = mean_rate - t_critical*sd_rate/ sqrt(number),
            upper_bound = mean_rate + t_critical*sd_rate/ sqrt(number)) 

# we draw the graph of the net approval rate changing over weeks with its confidence interval.
  ggplot(fixed_dates, aes(x = weeks, y = mean_rate, color = subgroup)) + 
  geom_ribbon(aes(ymin = lower_bound, ymax = upper_bound), 
              fill = "orange", 
              alpha = 0.25, 
              show.legend = FALSE) +
    facet_grid(rows = vars(subgroup)) +
    geom_line(aes(x = weeks, y = mean_rate, 
                  color = subgroup),
              show.legend = FALSE) +
    labs(title = "Biden's net approval ratings in 2022",
         subtitle = "Weekly Data, Approve - Disapprove, %",
         caption = "Source: https://projects.fivethirtyeight.com/biden-approval-data/",
         x = NULL, 
         y = NULL) +
    theme_minimal()

We can see President Biden’s net approval relative has remained negative since the first week of 2022 among all poll respondents, meaning more people disapprove than approve of the President. Furthermore, we notice a sharp decline in the net approval rate beginning in week 23. Since that time, the approval rate seems to have returned to pre-drop levels, potentially due to the POTUS’s recent delivery on several campaign promises, primarily passing the Inflation Reduction Act.

3 Challenge 1: Excess rentals in TfL bike sharing

This section focuses on analysing data on rentals in Tfl bike sharing.

url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"

# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2022-09-06T12%3A41%3A48/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20220910%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20220910T183325Z&X-Amz-Expires=300&X-Amz-Signature=08af6c1aef1cc7da37d0fdc7da3ca949e7f7d76703bb4b2a7efd5ed7bd6ca6d5&X-Amz-SignedHeaders=host]
##   Date: 2022-09-10 18:35
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 180 kB
## <ON DISK>  /var/folders/1j/4n8_k36148vb884h5ph4r83h0000gn/T//RtmpRbiTwY/file7adb7c015a0.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
                   sheet = "Data",
                   range = cell_cols("A:B"))

# change dates to get year, month, and week
bike <- bike0 %>% 
  clean_names() %>% 
  rename (bikes_hired = number_of_bicycle_hires) %>% 
  mutate (year = as.integer(year(day)),
          month = lubridate::month(day, label = TRUE),
          week = isoweek(day))

glimpse(bike)
## Rows: 4,416
## Columns: 5
## $ day         <dttm> 2010-07-30, 2010-07-31, 2010-08-01, 2010-08-02, 2010-08-0…
## $ bikes_hired <dbl> 6897, 5564, 4303, 6642, 7966, 7893, 8724, 9797, 6631, 7864…
## $ year        <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010…
## $ month       <ord> Jul, Jul, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug…
## $ week        <dbl> 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32…

First, we calculate the monthly change in Tfl bike rentals, calculated as the difference between the actual monthly average and the historical average calculated between 2016-2019. We plot the data faceting by year, using geom_ribbon() to visualize the positive/negative deltas.

#calculating expected number of rentals
compare <- bike %>%
  filter(year %in% c(2016:2019)) %>% 
  group_by(month) %>% 
  summarise(compare_avg = mean(bikes_hired))

#calculating monthly averages
avg <- bike %>% 
  filter(year %in% 2017:2022) %>% 
  group_by(year, month) %>% 
  summarise(actual_avg = mean(bikes_hired))
#joining datasets
left_join(avg, compare, by = 'month') %>%
  #calculating differences
  mutate(difference = actual_avg - compare_avg, 
         pos_diff = ifelse(difference > 0, actual_avg, 0),
         neg_diff = ifelse(difference < 0, compare_avg, 0)) %>%
  #plotting
  ggplot(aes(x = month)) +
  geom_line(aes(y = compare_avg, group = 1), color = "blue", lwd = 1.5) +
  geom_line(aes(y = actual_avg, group = 1)) +
  geom_ribbon(aes(ymin = compare_avg, ymax = pmax(0, difference) + compare_avg, fill = "red", alpha = 0.5, group = 1)) +
  geom_ribbon(aes(ymin = pmin(0, difference) + compare_avg, ymax = compare_avg, fill = "green", alpha = 0.5, group = 1)) +
  facet_wrap(vars(year)) +
  labs(title = "Monthly changes in Tfl bike rentals",
       subtitle = "Change from monthly average shown in blue and calculated between 2016-2019",
       caption = "Source: Tfl, London Data Store",
       x = NULL,
       y = "Bike rentals") +
  theme(legend.position = "none")

We can see that Tfl bike rentals have been lower than in the 2016-2019 at the beginning of the pandemic, but quickly recovered and exceeded historical data. Interestingly, there has been a significant uptake starting in the second half of 2021, possibly due to changing preferences regarding means of transport, with public transport losing users to Tfl bikes.

Next, we plot a similar graph to visualize weekly changes in Tfl bike rentals between actual data and the 2016-2019 average.

#calculating expected number of rentals
compare <- bike %>%
  filter(year %in% c(2016:2019)) %>% 
  group_by(week) %>% 
  summarise(compare_avg = mean(bikes_hired))

#calculating weekly averages
avg <- bike %>% 
  filter(year %in% 2017:2022) %>% 
  group_by(year, week) %>% 
  summarise(actual_avg = mean(bikes_hired))

#deleting aberrant observations (average for future weeks in 2022)
avg <- avg[-298,]

#joining dataframes
left_join(avg, compare, by = 'week') %>%
  #calculating differences
  mutate(diff = (actual_avg - compare_avg)/compare_avg,
         pos_diff = ifelse(diff > 0, diff, 0),
         neg_diff = ifelse(diff < 0, diff, 0)) %>%
  #plotting
  ggplot(aes(x = week, y = diff)) +
  scale_x_discrete(limits = c(13, 26, 39, 53)) +
  scale_y_continuous(labels = percent) +
  geom_rect(aes(xmin = 13, xmax = 26, ymin = -Inf, ymax = Inf), alpha = 0.3, fill = "grey90") +
  geom_rect(aes(xmin = 39, xmax = 53, ymin = -Inf, ymax = Inf), alpha = 0.3, fill = "grey90") +
  geom_line(aes(y = diff, group = 1), color = 'black', lwd = 0.8) +
  geom_ribbon(aes(ymin = 0, ymax = pmax(0, pos_diff)), fill = 'green', alpha = 0.3) +
  geom_ribbon(aes(ymin = pmin(0, neg_diff), ymax = 0), fill = 'red', alpha = 0.3) +
  geom_rug(aes(colour = diff), 
           sides = 'b', 
           length = unit(0.02, "npc"), 
           size = 1, 
           show.legend = FALSE) +
  binned_scale(aesthetics = "colour",
               scale_name = "stepsn",
               palette = function(x) c("red", "green"),
               breaks = c(0, 100)) +
  facet_wrap(vars(year)) +
  theme_minimal() +
  labs(title = "Weekly changes in Tfl bike rentals",
       subtitle = "% change from weekly averages calculated between 2016-2019",
       caption = "Source: Tfl, London Data Store",
       x = "Week",
       y = NULL)

Again, one can easily notice the drops at the beginning of 2020 (start of the pandemic) and in winter of 2021 (COVID wave), as well as the sizable increase in Tfl rentals since the second half of 2021.

It should be noted that the mean has been used to calculate the expected number of bike rentals for each month/week since the data follows a normal distribution, as seen in the histogram below. Otherwise, it would have been optimal to use the median instead, as it is a more robust measure of central tendency.

hist(bike$bikes_hired)

4 Challenge 2: Share of renewable energy production in the world

This last section focuses on analysing the share of renewable energy production around the world. We will be using datasets from the National Bureau of Economic Research (NBER) and the World Bank.

The following is a description of the variables from the NBER dataset:

variable class description
variable character Variable name
label character Label for variable
iso3c character Country code
year double Year
group character Group (consumption/production)
category character Category
value double Value (related to label)
technology <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-19/technology.csv')

#get all technologies
labels <- technology %>% 
  distinct(variable, label)

# Get country names using 'countrycode' package
technology <- technology %>% 
  filter(iso3c != "XCD") %>% 
  mutate(iso3c = recode(iso3c, "ROM" = "ROU"),
         country = countrycode(iso3c, origin = "iso3c", destination = "country.name"),
         country = case_when(
           iso3c == "ANT" ~ "Netherlands Antilles",
           iso3c == "CSK" ~ "Czechoslovakia",
           iso3c == "XKX" ~ "Kosovo",
           TRUE           ~ country))

#make smaller dataframe on energy
energy <- technology %>% 
  filter(category == "Energy")

# download CO2 per capita from World Bank using {wbstats} package
# https://data.worldbank.org/indicator/EN.ATM.CO2E.PC
co2_percap <- wb_data(country = "countries_only", 
                      indicator = "EN.ATM.CO2E.PC", 
                      start_date = 1970, 
                      end_date = 2022,
                      return_wide=FALSE) %>% 
  filter(!is.na(value)) %>% 
  #drop unwanted variables
  select(-c(unit, obs_status, footnote, last_updated))

# get a list of countries and their characteristics
# we just want to get the region a country is in and its income level
countries <-  wb_cachelist$countries %>% 
  select(iso3c,region,income_level)

First, produce a graph with the countries with the highest and lowest % contribution of renewables in energy production.

#manipulating the dataset
data <- energy %>% 
  filter(year == 2019) %>% 
  select(country, variable, value) %>% 
  pivot_wider(names_from = "variable", values_from = "value") %>% 
  arrange(country) %>% 
  mutate(renew_share = (elec_solar + elec_hydro + elec_wind + elec_renew_other)/elecprod,
         renew_rounded = round(renew_share, digits = 4)) %>% 
  drop_na(renew_rounded) %>% 
  filter(renew_rounded > 0) %>% 
  arrange(desc(renew_rounded)) %>% 
  select(country, renew_rounded)

#selecting top20 observations
max <- data %>%
  slice_max(n = 20, order_by = renew_rounded)

#plotting  
max_plot <- ggplot(max, aes(x = renew_rounded, y = reorder(country, renew_rounded))) +
            geom_bar(stat = 'identity') +
            scale_x_continuous(labels = percent) +
            labs(x = NULL, y = NULL)

#selecting min20 observations
min <- data %>% 
  slice_min(n = 20, order_by = renew_rounded)

#plotting
min_plot <- ggplot(min, aes(x = renew_rounded, y = reorder(country, renew_rounded))) +
            geom_bar(stat = 'identity') +
            scale_x_continuous(labels = percent) +
            labs(x = NULL, y = NULL)

library(patchwork)
full_plot <- max_plot + min_plot 

full_plot + plot_annotation(
  title = "Highest and lowest % in energy production",
  subtitle = "2019 data",
  caption = "NBER CHAT Database")

We can see that countries with the highest share of renewables in their energy mix are not necessarily the richest nations, 8 of which get 100% of their energy from renewable sources. Similarly, the ones with the lowest % of renewables used to produce energy include some of the wealthiest states in the world (Kuwait, Qatar), which is likely due to their access to vast amounts of oil.

Second, we can produce an animation to explore the relationship between CO2 per capita emissions and the deployment of renewables over time, faceted by income group.

#manipulating the data to facilitate time-series analysis
renewables <- energy %>% 
  select(iso3c, year, country, variable, value) %>% 
  pivot_wider(names_from = "variable", values_from = "value") %>% 
  arrange(country) %>% 
  mutate(renew_share = (elec_solar + elec_hydro + elec_wind + elec_renew_other)/elecprod,
         renew_rounded = round(renew_share, digits = 4)) %>% 
  drop_na(renew_rounded) %>% 
  filter(renew_rounded > 0) %>% 
  select(iso3c, year, country, renew_rounded)

#deleting aberrant observations (share of renewables > 100%)
renewables <- renewables[-c(1:3), ]

#joining datasets to create a single dataframe
joined_first <- left_join(renewables, co2_percap, by=c("iso3c" = "iso3c", "year" = "date"))
full_data <- left_join(joined_first, countries, by = "iso3c")

#filtering the dataset
for_plots <- full_data %>% 
  mutate(year = as.integer(year)) %>% 
  select(year, country.x, renew_rounded, value, income_level) %>%
  drop_na(income_level, value) %>% 
  filter(year %in% 1990:2020)

#plotting
plot <- for_plots %>% 
  ggplot(aes(x = renew_rounded, y = value, color = income_level)) +
  geom_point() +
  ylim(0, 50) +
  facet_wrap(vars(income_level)) +
  labs(title = 'Year: {frame_time}', 
       x = '% renewables', 
       y = 'CO2 per cap') +
  transition_time(year) +
  ease_aes('linear') +
  theme(legend.position = "none") +
  scale_x_continuous(labels = percent)

for_plots
## # A tibble: 3,851 × 5
##     year country.x   renew_rounded  value income_level
##    <int> <chr>               <dbl>  <dbl> <chr>       
##  1  2003 Afghanistan         0.671 0.0515 Low income  
##  2  2004 Afghanistan         0.632 0.0417 Low income  
##  3  2005 Afghanistan         0.632 0.0604 Low income  
##  4  2006 Afghanistan         0.761 0.0666 Low income  
##  5  2007 Afghanistan         0.791 0.0653 Low income  
##  6  2008 Afghanistan         0.744 0.128  Low income  
##  7  2009 Afghanistan         0.827 0.172  Low income  
##  8  2010 Afghanistan         0.802 0.244  Low income  
##  9  2011 Afghanistan         0.703 0.297  Low income  
## 10  2012 Afghanistan         0.809 0.259  Low income  
## # … with 3,841 more rows
plot

The animation shows that as the % of energy generated by renewables goes up, CO2 per capita emissions do not seem to go down. This would be signified by dots moving diagonally from the upper-left to the lower-right corner of the plot. However, no such moevement is detected. In fact, we can observe that quite a few countries move horizontally to the right, which means that their CO2 per capita emissions stay constant as the share of renewables in their energy mix increases.

5 Details

  • Who did you collaborate with: Samarth Sharma, Vivian van Oosten, Anastasia Fu, Jaelyn Shi, Andrew Robak, Shivant Maharaj
  • Approximately how much time did you spend on this problem set: 10h
  • What, if anything, gave you the most trouble: shadowing the two quarters in challenge 1